The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science

The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science

  • Downloads:8698
  • Type:Epub+TxT+PDF+Mobi
  • Create Date:2021-07-18 09:52:46
  • Update Date:2025-09-06
  • Status:finish
  • Author:Alex Gorelik
  • ISBN:1491931558
  • Environment:PC/Android/iPhone/iPad/Kindle

Summary

The data lake is a daring new approach for harnessing the power of big data technology and providing convenient self-service capabilities。 But is it right for your company? This book is based on discussions with practitioners and executives from more than a hundred organizations, ranging from data-driven companies such as Google, LinkedIn, and Facebook, to governments and traditional corporate enterprises。 You'll learn what a data lake is, why enterprises need one, and how to build one successfully with the best practices in this book。

Alex Gorelik, CTO and founder of Waterline Data, explains why old systems and processes can no longer support data needs in the enterprise。 Then, in a collection of essays about data lake implementation, you'll examine data lake initiatives, analytic projects, experiences, and best practices from data experts working in various industries。


Get a succinct introduction to data warehousing, big data, and data science
Learn various paths enterprises take to build a data lake
Explore how to build a self-service model and best practices for providing analysts access to the data
Use different methods for architecting your data lake
Discover ways to implement a data lake from experts in different industries

Download

Reviews

Mirjam

Two stars。Pro 1 The book is comprehensive Pro 2 Chapter 1 is almost great。 If you can this chapter it for free on Amazon: do it and don't buy。Con 1 My brain is not in the cloud but the writer underestimates my memory。 A lot of literal repetition troughout the book。Con 2 Glorification of the contributors and data scientist in general。 The sentence 'Michael Hausenblas is a long-term big data visionary [。。]' almost made me vomit。Con 3 The parts about encryption, anonimization and pseudonimization, Two stars。Pro 1 The book is comprehensive Pro 2 Chapter 1 is almost great。 If you can this chapter it for free on Amazon: do it and don't buy。Con 1 My brain is not in the cloud but the writer underestimates my memory。 A lot of literal repetition troughout the book。Con 2 Glorification of the contributors and data scientist in general。 The sentence 'Michael Hausenblas is a long-term big data visionary [。。]' almost made me vomit。Con 3 The parts about encryption, anonimization and pseudonimization, data stewards and the scope of the GDPR were impractical if not incorrect。 Made me suspicious about the rest。Con 4 It is a bit of a vendor book, and I suspect it is a bit skewed towards Informatica at the expense of Collibra, which is a very fine data governance tool (no shares on my part)。Con 5 There are a lot of very competent women in the field。 Did the writer forget to ask them for a contribution or did they reject him? 。。。more

Michael Lee

Excellent。 Precisely the level of abstraction that people in strategic positions need, while not sacrificing the technical detail necessary to put the pieces together。

Marian

Out of date handbook for complete Big Data beginners。 If you LITERALLY don’t know anything about Big Data, this book can be very informative, otherwise, not recommended。

M

Great Book。 Provides a High Level View of Implementing Big Data Lake Architecture in an enterprise。

Oleg Soroka

This book brings a good technology landscape overview, helps to align on terms and fundamental concepts。

Sebastian Gebski

2。2 starsPros: one of the very few books on the topic, interesting high-level patterns (e。g。 about splitting the data lake into zones, slowly changing dimensions), important (yet not very informative) chapter on self-service, the book being tech-agnostic。Cons: initial 20% of nearly stellar, but after that point (approx。) the author is out of content and the book gets very shallow (check the chapter about "architecting" 。。。), being 100% tech agnostic in the end appears too limiting; very importan 2。2 starsPros: one of the very few books on the topic, interesting high-level patterns (e。g。 about splitting the data lake into zones, slowly changing dimensions), important (yet not very informative) chapter on self-service, the book being tech-agnostic。Cons: initial 20% of nearly stellar, but after that point (approx。) the author is out of content and the book gets very shallow (check the chapter about "architecting" 。。。), being 100% tech agnostic in the end appears too limiting; very important topics like cataloging, tagging, lineage tracing are just mentioned (w/o any practicalities); the chapter "industry-specific perspectives" is 100% garbage - pure banalities, zero value insideWhat a disappointment :( 。。。more

Bendystraw

Giving this four stars because it was genuinely helpful to me as a data practitioner from the height of the BI/Data Warehousing era and a data lake skeptic。My big takeaway was that data lakes are simply a set of tools and approaches to data analytics and applications—an iteration of the data warehouse, now that we better understand the DW’s limitations and have new, cheaper technology available to address them。A lot of my skepticism was rooted in vendors and thoughtleaders shilling Data Lakes™️ Giving this four stars because it was genuinely helpful to me as a data practitioner from the height of the BI/Data Warehousing era and a data lake skeptic。My big takeaway was that data lakes are simply a set of tools and approaches to data analytics and applications—an iteration of the data warehouse, now that we better understand the DW’s limitations and have new, cheaper technology available to address them。A lot of my skepticism was rooted in vendors and thoughtleaders shilling Data Lakes™️ as an entirely new paradigm that will allow organizations to magically buy their way out of hard work like finding, cataloging, and normalizing data。 。。。more

Emre Sevinç

I find the current technology landscape both a blessing and a curse for companies that aim to be as data-driven as possible。 It's a blessing, because now that we have access to very flexible and powerful cloud computing systems, we can spin virtually unlimited storage and computing power on demand, and then run state-of-the-art data analysis, machine learning and AI systems。 But it's also a curse, because having such an easy access to so many practical, flexible technology solutions as well as p I find the current technology landscape both a blessing and a curse for companies that aim to be as data-driven as possible。 It's a blessing, because now that we have access to very flexible and powerful cloud computing systems, we can spin virtually unlimited storage and computing power on demand, and then run state-of-the-art data analysis, machine learning and AI systems。 But it's also a curse, because having such an easy access to so many practical, flexible technology solutions as well as powerful open source data systems, create the illusion that you can auto-magically overcome the challenges of creating more value based on company's data assets。 In principle, no company would object to being more data-driven and quickly employing flexible, smart automation solutions, and enhance their business processes with machine learning and artificial intelligence systems。 Unfortunately, many of those companies carry the risk of underestimating the importance of well-designed data management systems, processes, platforms and technologies, which, in turn, are inherently tied to data quality。 Unless these topics are handled properly, it’s very challenging and costly to build reliable predictive data analytics, machine learning, and AI systems whose raw materials are high quality and well-managed data assets。 One big concern is how to overcome such strategic challenges, in order to ensure that the ongoing data efforts will have the expected ROI。This book provides striking examples of not only best practices for complex enterprise big data management in service of creating value out of data assets, but also draws attention to pitfalls and risks involved。 The book's focus on topics such as the importance of smart, automated and well-designed Data Catalogs, challenges in capturing the context, metadata, and data lineage surrounding Critical Data Elements include many lessons that easily apply to many enterprises big and small alike。 Many companies also are negatively impacted by the dark data in their data landscape。I found the highlights below particularly important and relevant to my experience as a Data Officer in a very complex, international manufacturing environment: “The vision is often to eventually get rid of the data warehouse to save costs and improve performance, since big data platforms are much less expensive and much more scalable than relational databases。 However, just offloading the data warehouse does not give the analysts access to the raw data。 Because the rigorous architecture and governance applied to the data warehouse are still maintained, the organization cannot address all the challenges of the data warehouse, such as long and expensive change cycles, complex transformations, and manual coding as the basis for all reports。 Finally, the analysts often do not like moving from a finely tuned data warehouse with lightning-fast queries to a much less predictable big data platform, where huge batch queries may run faster than in a data warehouse but more typical smaller queries may take minutes” “Why is it so difficult to find data in the enterprise? Because the variety and complexity of the available data far exceeds human ability to remember it。 Imagine a very small database, with only a hundred tables (some databases have thousands or even tens of thousands of tables, so this is truly a very small real-life database)。 Now imagine that each table has a hundred fields—a reasonable assumption for most databases, especially the analytical ones where data tends to be denormalized。 That gives us 10,000 fields。 How realistic is it for anyone to remember what 10,000 fields mean and which tables these fields are in, and then to keep track of them whenever using the data for something new? Now imagine an enterprise that has several thousand (or several hundred thousand) databases, most an order of magnitude bigger than our hypothetical 10,000-field database。 I once worked with a small bank that only had 5,000 employees, but managed to create 13,000 databases。 I can only imagine how many a large bank with hundreds of thousands of employees might have。 The reason I say “only imagine” is because none of the hundreds of large enterprises that I have worked with over my 30-year career were able to tell me how many databases they had—much less how many tables or fields。 Hopefully, this gives you some idea of the challenge analysts face when looking for data。” “Ideally, the analysts should be able to request access to the data they need。 However, if they cannot find the data without having access to it, we have a catch-22。” “In most enterprises the knowledge about where data is, which data sets to use for what, and what data means is locked in people’s heads—this is commonly referred to as ‘tribal knowledge‘。” “Without a Data Catalog, in order to find a data set to use for a specific problem, analysts have to ask around until they find someone—if they’re lucky, a subject matter expert (SME)—who can point them to the right data。 SMEs can be difficult to find, though, so the analyst may instead run into someone who tells them about a data set that they used for a similar problem, and will then use that data set without really understanding what was done to it or where it came from。” The solution is a more agile approach to access control that some enterprises are beginning to adopt。 They create metadata catalogs that allow the analysts to find any data set without having access to it。 Once the right data sets have been identified, the analysts request access to them and the data steward or data owner decides whether to grant access, for how long, and for which portions of the data。 Once the access period expires, the access can be automatically revoked or an extension requested。 “It is much easier to document data sets when they are first created, because the information is fresh。 Nevertheless, even at Google, while some popular data sets are well documented, there is still a vast amount of dark or undocumented data。” “In traditional enterprises, the situation is much worse。 There are millions of existing data sets (files and tables) that will never get documented by analysts unless they are used—but they will never be found and used unless they are documented。 The only practical solution is to combine crowdsourcing with automation。” “More modern data catalogs—especially catalogs with automated tagging—allow data quality specialists and data stewards to define and apply data quality rules for a specific tag。” “For example, a numeric field with three-digit numbers ranging from 000 to 999 is very likely to be a credit card verification code if it is found next to a credit card number, but a field with exactly the same data is very unlikely to be a credit card verification code if found in a table where all the other tags refer to medical procedure properties。“ “To solve the completeness problem, create a data catalog of all the data assets, so the analysts can find and request any data set that is available in the enterprise。” 。。。more